Spam Ham Detection¶

Hayden Hoopes

The Spam Ham Detection dataset is a collection of text messages that have been labeled as either spam or ham. The dataset consists of a CSV file called spam.csv, which contains 5,572 text messages and their labels.

In this project, I will use the TextVectorization layer in Keras to prepare text data for binary classification on the Spam Ham Detection dataset. I will define and train a Keras model to classify text messages as either spam or ham.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('spam.csv', encoding='latin-1')

# Clean unnecessary columns and rename
df = df[['v1', 'v2']]
df = df.rename(columns={'v1': 'spam', 'v2': 'text'})
In [10]:
df['spam'] = df['spam'].map({'ham': 0, 'spam': 1})

labels = df['spam']
In [20]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df['text'], labels, test_size=0.3, random_state=1)

Baseline Accuracy¶

The baseline accuracy for this model is about 86.6% as seen below in the proportion of the majority class. That means that a classifier must at least be more accurate than 86.6% in order to be considered effective.

In [41]:
df['spam'].value_counts(normalize=True)
Out[41]:
0    0.865937
1    0.134063
Name: spam, dtype: float64

Bag of Words Model¶

Below, a baseline model that uses a bag of words approach to identify the presence of some words in the data set is constructed and evaluated to create a target for sequence modeling, which will be applied later. This bag of words model uses a max vocabulary of 10,000 (10,000 features) and a single dense layer with 32 nodes. Thus, the total number of parameters in the model is 320,032 plus the output layer with 1 node that contains 32 bias terms, totalling up 320,065 parameters.

According to the Loss by Epoch graphic below, during training, the model almost immediately begins overfitting after just one or two epochs, at which point the validation loss skyrockets. This could be due to the fact that the model does not take into consideration the ordering of the terms in a body of text and the relative sparsity of vocabulary in any message may make it easy to overtrain the model to find patterns that don't really exist.

That said, the best model trained had an accuracy of about 98% on the validation set. That's definitely better than 86.6%, and is a very good result assuming that the classes have been evenly distributed.

In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_test)

sequences = tokenizer.texts_to_sequences(X_test)
In [35]:
from tensorflow.keras.layers import TextVectorization

# This vectorizer uses both unigrams and bigrams
text_vectorization = TextVectorization(max_tokens=10_000, ngrams=(1,2), output_mode='multi_hot', pad_to_max_tokens=False)
text_vectorization.adapt(X_train)
In [36]:
X_train_vectorized = text_vectorization(X_train)
X_val_vectorized = text_vectorization(X_val)
In [39]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=10_000, hidden_dim=32):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation='relu')(inputs)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    
    model = keras.Model(inputs, outputs)
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

model = get_model()
model.summary()
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 10000)]           0         
                                                                 
 dense_2 (Dense)             (None, 32)                320032    
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 320,065
Trainable params: 320,065
Non-trainable params: 0
_________________________________________________________________
In [40]:
callbacks = [keras.callbacks.ModelCheckpoint('bag_of_words.keras', save_best_only=True)]

history = model.fit(X_train_vectorized, y_train, validation_data=(X_val_vectorized, y_val), epochs=50, callbacks=callbacks)
Epoch 1/50
122/122 [==============================] - 1s 8ms/step - loss: 0.3299 - accuracy: 0.9236 - val_loss: 0.1541 - val_accuracy: 0.9737
Epoch 2/50
122/122 [==============================] - 1s 8ms/step - loss: 0.0995 - accuracy: 0.9800 - val_loss: 0.0937 - val_accuracy: 0.9797
Epoch 3/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0607 - accuracy: 0.9867 - val_loss: 0.0900 - val_accuracy: 0.9803
Epoch 4/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0456 - accuracy: 0.9892 - val_loss: 0.0915 - val_accuracy: 0.9791
Epoch 5/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0355 - accuracy: 0.9910 - val_loss: 0.1007 - val_accuracy: 0.9785
Epoch 6/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0345 - accuracy: 0.9926 - val_loss: 0.1028 - val_accuracy: 0.9785
Epoch 7/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0301 - accuracy: 0.9923 - val_loss: 0.1088 - val_accuracy: 0.9791
Epoch 8/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0283 - accuracy: 0.9933 - val_loss: 0.1157 - val_accuracy: 0.9797
Epoch 9/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0272 - accuracy: 0.9946 - val_loss: 0.1211 - val_accuracy: 0.9797
Epoch 10/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0219 - accuracy: 0.9951 - val_loss: 0.1251 - val_accuracy: 0.9791
Epoch 11/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0201 - accuracy: 0.9954 - val_loss: 0.1300 - val_accuracy: 0.9791
Epoch 12/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0167 - accuracy: 0.9962 - val_loss: 0.1367 - val_accuracy: 0.9791
Epoch 13/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0183 - accuracy: 0.9962 - val_loss: 0.1399 - val_accuracy: 0.9797
Epoch 14/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0150 - accuracy: 0.9964 - val_loss: 0.1431 - val_accuracy: 0.9797
Epoch 15/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0159 - accuracy: 0.9969 - val_loss: 0.1484 - val_accuracy: 0.9791
Epoch 16/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0146 - accuracy: 0.9969 - val_loss: 0.1512 - val_accuracy: 0.9791
Epoch 17/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0118 - accuracy: 0.9974 - val_loss: 0.1620 - val_accuracy: 0.9785
Epoch 18/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0118 - accuracy: 0.9974 - val_loss: 0.1700 - val_accuracy: 0.9797
Epoch 19/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0117 - accuracy: 0.9972 - val_loss: 0.1722 - val_accuracy: 0.9785
Epoch 20/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0102 - accuracy: 0.9977 - val_loss: 0.1730 - val_accuracy: 0.9785
Epoch 21/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0105 - accuracy: 0.9974 - val_loss: 0.1800 - val_accuracy: 0.9785
Epoch 22/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0092 - accuracy: 0.9979 - val_loss: 0.1802 - val_accuracy: 0.9785
Epoch 23/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0083 - accuracy: 0.9979 - val_loss: 0.1827 - val_accuracy: 0.9785
Epoch 24/50
122/122 [==============================] - 1s 8ms/step - loss: 0.0076 - accuracy: 0.9982 - val_loss: 0.1898 - val_accuracy: 0.9785
Epoch 25/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0065 - accuracy: 0.9987 - val_loss: 0.1949 - val_accuracy: 0.9785
Epoch 26/50
122/122 [==============================] - 1s 8ms/step - loss: 0.0056 - accuracy: 0.9987 - val_loss: 0.2033 - val_accuracy: 0.9797
Epoch 27/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0051 - accuracy: 0.9982 - val_loss: 0.2072 - val_accuracy: 0.9791
Epoch 28/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0051 - accuracy: 0.9990 - val_loss: 0.2145 - val_accuracy: 0.9791
Epoch 29/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0047 - accuracy: 0.9987 - val_loss: 0.2166 - val_accuracy: 0.9791
Epoch 30/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0059 - accuracy: 0.9985 - val_loss: 0.2233 - val_accuracy: 0.9791
Epoch 31/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0053 - accuracy: 0.9979 - val_loss: 0.2215 - val_accuracy: 0.9797
Epoch 32/50
122/122 [==============================] - 1s 6ms/step - loss: 0.0029 - accuracy: 0.9987 - val_loss: 0.2283 - val_accuracy: 0.9791
Epoch 33/50
122/122 [==============================] - 1s 6ms/step - loss: 0.0049 - accuracy: 0.9987 - val_loss: 0.2241 - val_accuracy: 0.9797
Epoch 34/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0030 - accuracy: 0.9992 - val_loss: 0.2370 - val_accuracy: 0.9779
Epoch 35/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0029 - accuracy: 0.9992 - val_loss: 0.2365 - val_accuracy: 0.9797
Epoch 36/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0027 - accuracy: 0.9995 - val_loss: 0.2439 - val_accuracy: 0.9791
Epoch 37/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0040 - accuracy: 0.9992 - val_loss: 0.2418 - val_accuracy: 0.9797
Epoch 38/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0036 - accuracy: 0.9990 - val_loss: 0.2466 - val_accuracy: 0.9791
Epoch 39/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.2593 - val_accuracy: 0.9779
Epoch 40/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0024 - accuracy: 0.9995 - val_loss: 0.2601 - val_accuracy: 0.9785
Epoch 41/50
122/122 [==============================] - 1s 4ms/step - loss: 0.0026 - accuracy: 0.9992 - val_loss: 0.2711 - val_accuracy: 0.9773
Epoch 42/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9995 - val_loss: 0.2767 - val_accuracy: 0.9779
Epoch 43/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0024 - accuracy: 0.9995 - val_loss: 0.2718 - val_accuracy: 0.9785
Epoch 44/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0017 - accuracy: 0.9997 - val_loss: 0.2678 - val_accuracy: 0.9797
Epoch 45/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0021 - accuracy: 0.9995 - val_loss: 0.2746 - val_accuracy: 0.9797
Epoch 46/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0020 - accuracy: 0.9995 - val_loss: 0.2804 - val_accuracy: 0.9785
Epoch 47/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0018 - accuracy: 0.9995 - val_loss: 0.2843 - val_accuracy: 0.9791
Epoch 48/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0019 - accuracy: 0.9995 - val_loss: 0.2789 - val_accuracy: 0.9785
Epoch 49/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0034 - accuracy: 0.9992 - val_loss: 0.2835 - val_accuracy: 0.9785
Epoch 50/50
122/122 [==============================] - 1s 7ms/step - loss: 0.0015 - accuracy: 0.9997 - val_loss: 0.2921 - val_accuracy: 0.9785
In [43]:
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
In [48]:
test_model = keras.models.load_model('bag_of_words.keras')
test_model.evaluate(X_val_vectorized, y_val)
53/53 [==============================] - 0s 1ms/step - loss: 0.0900 - accuracy: 0.9803
Out[48]:
[0.09002583473920822, 0.9802631735801697]

Sequence Modeling¶

Next, sequence modeling is used to determine if the accuracy of detecting spam messages can be improved by taking into account the ordering of the words in a message rather than just considering the presence of words individually. This model also uses a max vocabulary of 10,000 but also uses sequences of length 1,000, meaning that the total length of any given body of text that is considered by the model in a single window is 1,000 words. The total number of parameters is 1,284,673.

The Loss by Epoch graph again shows that this model seems to be overfitting from epoch number 1. The validation loss decreases at first slightly and then increases as the epochs continue. This is surprising to me, since I expected the embedding layer to add increased dimensionality to the model that would help it recognize different patterns in the text that occur between related words. I also expected the model to need more training because of the bidirectional RNN layers that were used, which I would have assumed needed more time to adapt to the particular words of this data set and backpropogate the updated weights through the model over time.

This model had a validation accuracy of 97.9%, which is only slightly worse than the previous model. This could be due to random chance. While the difference between the validaiton accuracy of both models are negligible, it is likely safe to say that either model would perform well in a real world situation for classifying messages as spam or not.

Report the model.summary(). How many parameters does your model have? (5 points) Fit the model with 5 epochs. (5 points) (hint: history = model.fit(X_train_vectorized, y_train, validation_data= (X_val_vectorized, y_val), epochs=5, (this is going to take a while, so let's just run it for 5 epochs) callbacks=callbacks) Plot the epoch-Loss graph and comment on that (for example, where does the model starts overfitting and etc) (5 points) Report the accuracy in the validation set for the best model. (you need to load the best model from ModelCheckpoint callback) (5 points)

In [51]:
text_vectorization = layers.TextVectorization(max_tokens=10_000, output_sequence_length=1_000, output_mode='int')
text_vectorization.adapt(X_train)

X_train_vectorized = text_vectorization(X_train)
X_val_vectorized = text_vectorization(X_val)
In [58]:
inputs = keras.Input(shape=(None,), dtype='int64')
embedded = layers.Embedding(input_dim=10_000, output_dim=128, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.SimpleRNN(16))(embedded)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 128)         1280000   
                                                                 
 bidirectional (Bidirectiona  (None, 32)               4640      
 l)                                                              
                                                                 
 dropout_2 (Dropout)         (None, 32)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,284,673
Trainable params: 1,284,673
Non-trainable params: 0
_________________________________________________________________
In [59]:
callbacks = [keras.callbacks.ModelCheckpoint('sequence_modeling.keras', save_best_only=True)]

history = model.fit(X_train_vectorized, y_train, validation_data=(X_val_vectorized, y_val), epochs=5, callbacks=callbacks)
Epoch 1/5
122/122 [==============================] - 37s 289ms/step - loss: 0.3195 - accuracy: 0.8872 - val_loss: 0.1079 - val_accuracy: 0.9713
Epoch 2/5
122/122 [==============================] - 36s 297ms/step - loss: 0.0650 - accuracy: 0.9856 - val_loss: 0.0725 - val_accuracy: 0.9797
Epoch 3/5
122/122 [==============================] - 35s 290ms/step - loss: 0.0286 - accuracy: 0.9936 - val_loss: 0.0752 - val_accuracy: 0.9761
Epoch 4/5
122/122 [==============================] - 35s 287ms/step - loss: 0.0144 - accuracy: 0.9954 - val_loss: 0.0828 - val_accuracy: 0.9791
Epoch 5/5
122/122 [==============================] - 35s 291ms/step - loss: 0.0067 - accuracy: 0.9982 - val_loss: 0.0990 - val_accuracy: 0.9707
In [60]:
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
In [61]:
test_model = keras.models.load_model('sequence_modeling.keras')
test_model.evaluate(X_val_vectorized, y_val)
53/53 [==============================] - 2s 41ms/step - loss: 0.0725 - accuracy: 0.9797
Out[61]:
[0.07249666750431061, 0.9796651005744934]

Testing the Model¶

In this section, I would like to test the models using some made up message to see if the model can differentiate between spam and not spam messages.

Most of the messages it seems to have a decently good idea of how to classify as either spam or not spam. It seems like the word "claim" and the words "click" and "link" may be important for identifying spam messages.

In [64]:
model = keras.models.load_model('bag_of_words.keras')

messages = [
    'Hey bro how r u doin?',
    'I just won a contest by clicking this link',
    'Your package could not be delivered, click this link',
    'Have you heard the news about social security insurance',
    'Claim your benefits today by clicking',
    'I am running late but will be there soon!'
]

text_vectorization = TextVectorization(max_tokens=10_000, ngrams=(1,2), output_mode='multi_hot', pad_to_max_tokens=False)
text_vectorization.adapt(X_train)

input_text = text_vectorization(messages)
output = model.predict(input_text)

print(output)
1/1 [==============================] - 0s 30ms/step
[[0.01035442]
 [0.26425034]
 [0.24674352]
 [0.05468433]
 [0.45137173]
 [0.00176566]]

Conclusion¶

This project was extremely insightful into learning about the applications of different types of neural networks for NLP classification tasks. While I expected to see the sequence modeling approach perform better than the bag of words model, the differences in performance are extremely small and both models performed extremely well.

In my opinion, one of the reasons that the sequential modeling approach did not outperform bag of words vectorization is because text messages tend to be short messages with a large variety of vocabulary where ordering may not be as important as it is in formal writing. The use of abbreviations, slang, and combinations of these features with regular words likely makes it difficult for a small RNN like the one used in this file to spot clear relationships between the ordering of certain words and spam messages.

In the future, this model could perhaps be improved by increasing the amount of vocabulary contained in the vectorizer and by increasing the number of layers in either of the models. This may allow the future models to better identify the patterns that occur in text messages that would allow them to be accurately classified as either spam or not spam.